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extensions  are  interesting  also  from  the  point  of  view  of  the  approximation  of  multi¬ 
variate  functions.  The  first  extension  corresponds  to  dealing  with  outliers  among  the 
sparse  data.  The  second  one  corresponds  to  exploiting  information  about  points  or 
regions  in  the  range  of  the  function  that  are  forbidden. 
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1  Introduction 

In  previous  papers  (Poggio  and  Girosi,  1989,  1990)  we  have  shown  the  equiv¬ 
alence  between  regularization  and  a  class  of  three-layer  networks  that  we 
called  regularization  networks  and  that  axe  related  to  the  classical  interpola¬ 
tion  technique  of  Radial  Basis  Functions. 

Let  g  =  {(x,, y.)  €  Rn  x  P}^  be  a  set  of  data  that  we  want  to  approx¬ 
imate  by  means  of  a  function  /.  The  regularization  approach  (Tikhonov, 
1963;  Tikhonov  and  Arsenin,  1977;  Morozov,  1984;  Bertero,  1986)  selects 
the  function  /  that  solves  the  variational  problem  of  minimizing  the  func¬ 
tional 


m = Ete  -  /(*))* + All/’/ll2  (i) 

i=i 

where  P  is  a  constraint  operator  (usually  a  differential  operator),  ||  •  ||  is 
a  norm  on  the  function  space  to  which  Pf  belongs  (usually  the  L 2  norm) 
and  A  is  a  positive  real  number,  the  so  called  regularization  parameter.  The 
structure  of  the  operator  P,  that  is  called  “stabilizer",  embodies  the  a  priori 
knowledge  about  the  solution,  and  therefore  depends  on  the  nature  of  the 
particular  problem  that  has  to  be  solved.  We  have  shown  (Poggio  and  Girosi, 
1989)  that  the  solution  of  the  variational  problem  (1)  has  the  following  simple 
form: 


/(x)  =  X^G(x;x<)  +  p(x) 

i=i 

where  G(x)  is  the  Green’s  function  (Stakgold,  1979)  of  the  self-adjoint  dif¬ 
ferential  operator  PP,  P  being  the  adjoint  operator  of  P,  p(x)  is  a  linear 
combination  of  functions  that  span  the  null  space  of  P,  and  the  coefficients 
Cj  satisfy  a  linear  system  of  equations  that  depend  on  the  N  “examples” ,  i.e. 
the  data  to  be  approximated.  The  form  of  the  term  p(x)  depends  on  the 
stabilizer  that  has  been  chosen  and  on  the  boundary  conditions,  and  there¬ 
fore  on  the  particular  problem  that  has  to  be  solved  (for  instance,  it  is  not 
needed  in  the  case  of  P  corresponding  to  a  Gaussian  or  bell-shaped  Green’s 
function).  For  this  reason,  and  since  its  inclusion  ^oes  not  modify  the  main 
conclusions,  we  will  disregard  it  in  the  following.  In  the  special  case  in  which 
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P  is  an  operator  with  radial  symmetry,  the  Green’s  function  G  is  radial  and 
therefore  the  approximating  function  becomes: 

/(x)  =  E^(||x-xt||2),  (2) 

i=i 

which  is  a  sum  of  radial  functions,  each  with  its  center  x<  on  a  distinct  data 
point.  Thus  the  number  of  radial  functions,  and  corresponding  centers,  is 
the  same  as  the  number  of  examples. 

In  this  note  we  indicate  how  to  extend  our  theory  of  learning  from  exam¬ 
ples  in  order  to  deal  with  1)  occurence  of  unreliable  examples,  2)  negative 
examples.  Both  problems  are  also  interesting  from  the  point  of  view  of  clas¬ 
sical  approximation  theory: 

1.  discounting  “bad”  examples  corresponds  to  discarding,  in  the  approxi¬ 
mation  of  a  function,  data  points  that  are  outliers. 

2.  learning  by  using  negative  examples  -  in  addition  to  positive  ones  - 
corresponds  to  approximating  a  function  based  not  only  on  points  to 
which  the  function  must  be  close  but  also  on  points  -  or  regions  -  that 
the  curve  associated  with  the  function  must  avoid. 

2  Unreliable  data 

Suppose  that  the  set  g  =  {(xt,  y*)  €  Rn  x  of  data  has  been  obtained 

by  random  sampling  a  function  /,  defined  on  Rn,  in  presence  of  noise.  We 
are  interested  in  recovering  the  function  /,  or  an  estimate  of  it,  from  the 
set  of  data  g.  We  take  a  probabilistic  approach,  and  regard  the  function 
/  and  the  data  g  as  random,  dependent,  variables.  Using  Bayes  theorem, 
it  is  possible  to  express  the  conditional  probability  V[f\g\  of  the  function  / 
given  the  examples  g  in  terms  of  the  a  priori  probability  of  /,  V[f),  and  the 
conditional  probability  of  g  given  /,  V\g\f],  that  is  equivalent  to  a  model  of 
the  noise: 


nf  is]  -pisi/i  m- 


If  the  noise  is  Gaussian  the  probability  P[g(fJ  can  be  written  as: 


(3) 
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P[g\f]  oc  e-ZLa(*-'(x'»3  (4) 

where  #  =  and  <7,-  is  the  variance  of  the  noise  related  to  the  i-th  data 
point.  Under  some  assumption  on  the  stochastic  process  /  (Marroquin  et  al., 
1987;  Geman  and  Geman,  1984)  it  is  possible  to  write  the  a  priori  probability 
V\f\  in  the  following  way: 


V[f]  oc  e-AHp'Ha 

where  P  is  a  constraint  operator  (usually  a  differential  operator),  ||  •  ||  is  a 
norm  on  the  function  space  to  which  Pf  belongs  (usually  the  L2  norm)  and 
A  a  positive  real  number.  This  form  of  probability  distribution  gives  high 
probability  only  to  those  functions  for  which  the  term  ||P/||2  is  small,  and 
embodies  the  a  priori  knowledge  that  one  has  about  the  system.  For  example 
if  one  knows  that  the  function  /  that  has  been  sampled  is  very  smooth,  in 
the  sense  that  it  does  not  vary  too  “quickly”  in  its  domain,  the  operator  P 
will  be  a  differential  operator  of  high  degree. 

Using  Bayes  theorem  (3)  the  a  posteriori  probability  of  /  can  be  written 
as 

V[f\g]  oc  e-C.-»A(w-/(x‘»2+A»p/|l2l.  (5) 

A  simple  way  to  obtain  an  estimate  of  the  function  /  from  the  probability 
distribution  (5)  consists  in  taking  the  so  called  MAP  (Maximum  A  Posteriori) 
estimate,  that  is  the  function  that  maximizes  the  a  posteriori  probability 
V[f  I#],  or  minimizes  the  exponent  in  equation  (5).  Setting  for  simplicity  all 
the  variances  o-,  equal  to  one  fixed  variance  <7,  and  defining  from  here  on 

A,  =  y<  -  f(x<)  , 

the  MAP  estimate  of  /  is  then  the  minimum  of  the  following  functional: 

Holf]  =  ^s'tv^)  +  HPff  (6) 

where  we  have  defined  the  quadratic  function 

V(x )  =  x2 
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This  is  equivalent  to  the  so  called  “regularization  technique”  (Tikhonov, 
1963;  Tikhonov  and  Arsenin,  1977;  Morozov,  1984;  Bertero,  1986)  that  has 
been  extensively  used  in  order  to  solve  ill-posed  problems ,  of  which  this  is  a 
particular  example.  The  parameter  A,  that  is  usually  called  “regularization 
parameter”,  determines  the  trade-off  between  the  level  of  the  noise  and  the 
strength  of  the  assumptions  about  the  solution,  therefore  controlling  the  com¬ 
promise  between  the  degree  of  smoothness  of  the  solution  and  its  closeness 
to  the  data. 

In  the  approach  outlined  here  we  have  assumed  to  know  the  variance  of 
the  noise  associated  with  each  data  point,  but  this  assumption  is  not  always 
realistic.  Sometimes  we  know  that  some  of  the  data  can  be  affected  by  a 
high  amount  of  noise,  or  can  be  completely  wrong.  In  order  to  deal  with  this 
situation,  we  regard  the  variances  of  the  noise,  as  well  as  the  unknown  func¬ 
tion,  as  random  variables.  Of  course,  some  a  priori  knowledge  about  these 
variables,  represented  by  an  appropriate  a  priori  probability  distribution,  is 
needed.  Let  us  denote  by  0  the  set  of  random  variables  {0i}^Lt.  By  means 
of  Bayes  theorem  we  can  compute  the  joint  probability  of  the  function  /  and 
of  the  set  0: 


where  V[g\f,  0]  is  the  same  as  in  equation  (4)  and  V[0]  is  the  a  priori  prob¬ 
ability  of  the  set  of  variances  0.  The  model  above,  that  leads  to  standard 
regularization,  is  recovered  by  setting 

m-Tlw-fi) 

•a:  1 

where  0’  are  some  fixed  values.  Depending  on  the  a  priori  knowledge  on  0 
different  models  may  arise,  corresponding  to  different  choices  of  V{0\.  Here 
we  consider  the  following  situation:  we  have  knowledge  that  a  certain  per¬ 
centage,  e,  of  data  is  spurious  (we  will  call  them  “outliers”)  whereas  a  per¬ 
centage  (1  —  e)  is  characterized  by  a  Gaussian  noise  distribution  of  variance 
0m.  Therefore  there  are  only  two  possibilities:  0i  =  0m,  for  the  “true”  data 
points,  and  0i  =  0,  for  the  outliers.  This  situation  leads  to  choosing  the 
following  probability  distribution: 
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(8) 


m  =  nto  -  wa  -n+‘  <w)i  • 

«=i 

Given  the  a  posteriori  probability  (7)  we  are  mainly  interested  in  comput¬ 
ing  an  estimate  of  /.  Thus  what  we  really  need  to  compute  is  the  marginal 
posterior  probability  of  /,  Pm[f],  that  is  obtained  integrating  equation  (7) 
over  the  variables  /?,•: 


•/0  i=i 

Using  the  model  for  V[0\  described  by  equation  (8)  we  obtain: 


Pm[f\ oc  e-A"F/ll2  ft  [°°  dxt-x^[{\  -  e)6(x  -/?*)  +  e  <5(x)]. 

.=i  70 


The  integral  yields 


A.W  <x  n  I— «-,r4? + 1] 


i=i 


In  order  to  make  clear  the  meaning  of  such  a  marginal  probability  distri¬ 
bution  we  rewrite  as: 


Pm[f]  oc  v^ao+aIIP/IP) 

where  we  have  defined  the  effective  potential 

and  we  have  set  7  =  In  The  MAP  estimate  for  /  given  by  this  probability 
distribution  is  obtained  by  minimizing  the  functional 


BJJ]  =  P  £  +  A||P/f )  .  (9) 

»=1 

The  introduction  the  random  variables  /?,  leads,  therefore,  to  a  new  mini¬ 
mization  problem.  Let  us  compare  the  functionals  (9)  and  (6).  The  functional 
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(9)  is  similar  to  the  standard  regularization  functional  (6),  the  only  difference 
being  in  the  data  term.  In  the  standard  regularization  functional  the  data 
term  consists  of  the  sum  over  all  the  data  of  a  quadratic  function  V  of  the 
interpolation  error  At,  and  its  role  is  to  enforce  closeness  of  the  solution  to 
the  data.  In  the  last  case  the  quadratic  function  V  has  been  substituted  by 
the  function  Ve//,  depicted  in  figures  (1)  and  (2),  whose  shape  depends  on 
the  parameters  0*  and  e. 

Figure  (1)  shows  the  effective  potential  for  different  values  of  e,  and  for 
0“  =  1.0.  In  the  case  of  e  =  0  we  obviosly  recover  the  regularization  model, 
since 


lim  Vefj(x)  =  V(x)  =  x2. 

When  e  is  different  from  zero  Vejj(x)  has  two  different  behaviours:  quadratic 
in  a  neighborhood  of  the  origin,  and  constant  far  away  from  it.  The  effect 
of  this  behavior  is  clear:  closeness  to  the  data  is  enforced  only  when  the 
interpolation  error  is  small.  In  particular  we  notice  that: 

limVe//(*)  =  2(l-e)x2. 

When  e  increases  and  approaches  1  the  effective  potential  becomes  flatter 
and  flatter,  which  is  equivalent  to  the  effective  variance  of  the  noise  becoming 
larger  and  larger. 

Let  us  consider  the  case  of  positive  values  of  7,  that  corresponds  to  values 
of  e  smaller  than  0.5.  This  is  the  usual  case,  since  e  represents  the  percentage 
of  “true”  data  points.  (2)  In  the  limit  of  0*  — ►  00  the  effective  potential  V'ff 
is  quadratic  if  the  absolute  value  of  its  argument  is  smaller  than  ^7  and 
constant  otherwise  (fig.  2).  This  corresponds  to  the  situation  in  which  we 
have  “true”  data  points  without  noise:  therefore  data  points  are  considered 
reliable  if  the  interpolation  error  is  smaller  than  a  threshold  ( v/7)  and  their 
contribution  neglected  otherwise.  In  the  case  of  negative  values  of  7,  which  is 
the  case  of  a  percentage  of  outliers  greater  than  50%,  the  effective  potential, 
that  is  already  flat,  becomes  even  flatter  when  0*  increases.  This  case  is  not 
very  interesting  and  in  the  following  we  will  always  make  the  assumption 
that  7  >  0,  that  is  c  <  0.5. 

The  standard  regularization  functional  and  the  functional  (9)  admit  a 
simple  physical  interpretation.  Let  us  consider  for  simplicity  a  function  de¬ 
fined  on  a  one-dimensional  lattice.  The  value  of  the  function  /(x,)  at  site  1 
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is  regarded  as  the  position  of  a  particle  that  can  move  only  in  the  vertical 
direction.  The  particle  is  connected  by  a  spring  to  a  point  that  corresponds 
to  the  data  value  t/;,  and  is  also  connected  by  springs  to  some  neighboring 
particles.  The  size  of  the  neighborhood  can  vary,  but  the  overall  effect  is  such 
that  the  values  of  the  function  at  neighboring  sites  tend  to  be  the  same.  The 
particle  is  attracted,  with  a  quadratic  potential,  by  the  data  point,  but  it  is 
also  attracted  by  the  neighboring  particles:  the  configuration  of  the  system 
will  be  the  one  that  minimizes  the  total  energy,  depending  on  the  trade  off 
between  these  two  different  effects.  The  energy  of  the  system  corresponds 
in  this  scheme  to  the  standard  regularization  functional:  the  first  term  is 
associated  to  the  springs  connecting  the  particle  to  the  data  point,  and  the 
second  term  is  associated  to  the  the  springs  connecting  neighboring  particles, 
whose  role  is  to  enforce  smoothness  of  the  final  configuration.  The  stabilizer 
is  represented  by  the  relative  strength  and  the  extension  of  the  connections 
of  the  particles  at  neighboring  sites:  a  stabilizer  of  high  degree  corresponds 
to  a  system  in  which  a  particle  at  a  site  is  connected  to  particles  at  sites  very 
far  away. 

The  functional  (9)  admits  a  similar  interpretation,  the  only  difference 
being  the  kind  of  springs  that  connect  the  function  value  to  the  data  point: 
in  this  case  the  potential  energy  of  these  springs  is  not  quadratic  anymore, 
that  is  the  force  associated  to  each  spring  does  not  grow  linearly  with  its 
elongation.  The  potential  energy  becomes  constant  when  the  elongation  is 
larger  than  the  threshold  e,  and  the  force  (that  is  proportional  to  the  first 
derivative  of  the  potential  energy)  goes  to  zero.  In  a  sense  these  springs  break 
if  we  try  to  stretch  them  too  much. 

3  Negative  examples 

As  we  have  seen  in  the  previous  section,  standard  regularization  admits  an 
interpretation  in  term  of  linear  springs,  whereas  regularization  in  presence 
of  unreliable  data  needs  an  interpretation  in  term  of  nonlinear  springs,  that 
break  when  the  elongation  is  too  large.  Nonlinear  springs  have  also  been 
used  to  deal  with  discontinuities  (Geiger  and  Girosi,  1939,  1990;  Blake  and 
Zisserman,  1987),  and  we  show  now  another  case  in  which  they  are  very 
useful. 

In  many  situations,  further  source  of  information  about  the  function  may 
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consist  of  knowing  i  hat  its  value  at  sum**  given  point  lias  to  lie  far  from  a 
given  value  (which.  in  this  context.  ran  lie  seen  as  a  "iterative  example’*  I. 
We  shall  account  lor  the  presence  of  negative  examples  by  in) rodueing  <t 
iptadratic  repulsive  term  (  a  sort  of  "repulsive'*  spring)  in  the  regularization 
fnnetional.  one  for  earh  negative  example  (for  a  related  triek.  see  Kass  et  al.. 
t!)iS7).  However  the  introduction  of  such  a  term  might  make  the  regulariza¬ 
tion  functional  unbounded  from  below,  because  the  repulsive  spring  will  tend 
to  push  the  value  of  the  function  up  to  infinity.  The  simplest  way  to  prevent 
this  occureucy  is  to  allow  the  spring  constant  to  decrease  with  the  increasing 
elongation,  or  in  the  extreme  case,  to  break  at  some  point.  W'e  can  use  the 
same  model  of  nonlinear  spring  of  the  previous  section,  and  just  reverse  the 
sign  of  the  associated  potential  (see  figure  (.'})). 

If  (it,,.//,,)  €  R"  '  is  the  set  of  negative  examples,  and  if  we  define 


X  =//,.  -  /( fc . ) 


the  regularizing  functional  can  be  written  as 


m  =  ZV'tAO  -  £  V.„(A.)  +  A||P/f  . 

1=1  0=1 

4  Solution  of  the  variational  problem 

In  this  section  we  discuss  the  solution  of  the  variational  problem  associated 
with  the  regularizing  functionals  of  the  previous  sections.  Since  the  cases 
of  unreliable  data  and  of  negative  examples  are  formally  similar  we  will  de¬ 
rive  the  equations  only  in  the  case  of  unreliable  data.  The  functional  to  be 
minimized  is 


«.(/]  =  /?■£  V,„(AO  +  A||P/||2)  ,  (10) 

1=1 

and  the  Euler-Lagrange  equations  for  this  functional  have  the  form: 

>P/(x)  =  §EK//(AO«(x-xj)  (11) 

where  Veyy(x)  is  the  first  derivative  of  Veff(x),  that  is 

KW*)  =  i  +  ■ 

We  notice  that  the  in  the  limit  of  e  — *  0,  that  is  in  the  case  of  springs  that 
never  break,  7  goes  to  infinity  and  V'{x)  — ►  2x.  In  this  case  the  standard 
regularization  equations 


PPf(  (12) 

A  1=1 

are  recovered.  Equation  (11)  shows  the  same  structure  of  that  associ¬ 
ated  with  the  standard  regularization  case,  and  the  solution  can  be  derived 
using  the  Green  function  technique  (Stakgold,  1979).  As  in  the  standard 
regularization  case,  (Poggio  and  Girosi,  1989)  the  solution  will  be  a  linear 
superposition  of  Green  functions,  one  for  each  data  point: 
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(13) 


/(x)  =  £c,G(x;x,) 

t=l 

where,  in  the  general  case, 


PK//(±) 

*  2A 

We  notice  however  that  expression  (13)  is  not  the  complete  solution  of 
the  minimization  problem.  In  fact  all  the  functions  that  lie  in  the  null  space 
of  the  operator  P  are  “invisible”  to  the  smoothing  term  in  the  functional 
(10),  so  that  the  previous  expansion  is  the  solution  modulo  a  term  that  lies 
in  the  null  space  of  P.  According  to  the  considerations  contained  in  section 
1,  in  the  following  we  will  drop  it  from  equations. 

In  order  to  find  the  vector  c  of  coefficients  c,-  we  substitute  the  expansion 
of  equation  (13)  in  the  functional  H  [f]  defined  in  equation  (10),  that  becomes 
a  function  H*(c)  of  the  coefficients.  Thus  the  vector  c  minimizes  the  function 
Hm{c),  which  leads  to  the  following  set  of  equations: 

c)  =  0  *  =  1 . N  (14) 

Gradient  descent  is  probably  the  simplest  approach  for  attempting  to 
find  the  solution  to  this  minimization  problem,  though,  of  course,  it  is  not 
guaranteed  to  converge.  Several  other  iterative  methods,  such  as  versions  of 
conjugate  gradient  and  simulated  annealing  may  be  more  appropriate  than 
gradient  descent,  and  their  use  is  reccomended.  In  the  gradient  descent 
method  the  vector  c  that  minimizes  Hm( c)  is  regarded  as  the  stable  fixed 
point  of  the  following  dynamical  system: 


c  =  —UJ 


dH'jc) 
dc 


(15) 


where  ui  is  a  parameter  determining  the  microscopic  timescale  of  the  problem 
and  is  related  to  the  rate  of  convergence  to  the  fixed  point. 

We  consider  for  simplicity  the  case  of  positive  definite  Green’s  functions, 
that  do  not  require  any  additional  term  in  eq.  (13).  In  this  case  it  has  been 
shown  (Poggio  and  Girosi,  1989)  that,  with  natural  boundary  conditions,  we 
can  write 
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II-P/II2  =  c-Gc. 

where  G  is  the  symmetric  matrix  ( G)ij  ~  G(x,;Xj)  -  its  symmetry  coming 
from  the  fact  that  the  operator  PP  is  self-adjoint. 

Equations  (15)  have  then  the  following  form: 


d 

c~  udc 


N 


P'E,V.f,(&i)  +  Xc-Gc 


»=I 


that,  defining 


9'  ~  i  +  e-/3(e-A?);  ”  9% *'i 

can  be  written  as 

c  =  -2wG[(/m?  +  A/)c  -  p Ey]  (16) 

wher  /  is  the  identity  matrix.  The  vector  c  that  mimizes  H‘( c)  has  then  to 
satisfy  the  following  set  of  non  linear  equations: 

(/3*EG  +  A/)c  =  /T£y,  (17) 

the  non  linearity  being  contained  in  the  matrix  E,  that  is  a  nonlinear  function 
of  the  unknowns.  Notice  that 


lim£  =  I 
«-*o 

and  in  this  case  the  linear  standard  equations  are  recovered  (Poggio  and 
Girosi,  1989).  The  main  implication  of  the  nonlinearity  is  that  the  solution  of 
these  equations  is  not  unique  anymore,  the  different  solutions  corresponding 
to  the  local  minima  of  the  functional  (10).  Notice  that  it  is  straightforward  to 
modify  the  previous  gradient  descent  equations  in  order  to  take  into  account 
negative  examples. 


5  Experimental  Results 

In  this  section  we  describe  some  results  that  we  obtained  applying  these 
techniques  to  very  simple  one-dimensional  problems.  We  first  discuss  an 
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example  unreliable  data,  and  then  a  problem  with  negative  examples.  We 
used  a  gradient  descent  algorithm  with  adaptive  step,  running  on  a  SUN  4 
workstation.  The  code  for  these  simulations  has  been  written  in  Common 
Lisp,  and  in  all  the  examples  that  we  will  describe  in  the  next  section,  the 
time  required  for  100  iterations  of  the  gradient  descent  algorithm  was  about 
30  seconds.  In  the  following  figures  data  points  are  represented  by  large  dots. 

5.1  Unreliable  data 

We  approximate  the  function  f(x)  =  cos(x)  in  the  interval  [—1, 1].  The 
data  set  consisted  of  seven  examples,  randomly  chosen  from  the  graph  of  /. 
In  order  to  create  an  outlier  in  the  data  set,  we  substituted  the  value  of  the 
fourth  point  with  the  value  1.5,  that  is  50%  larger  of  the  largest  value  of  the 
other  data  points.  The  Green’s  function  of  the  problem  was  a  Gaussian  of 
variance  <r  =  0.3,  the  parameter  e  was  set  to  0.1,  and  the  parameter  3'  was 
set  to  6.  With  this  values  of  e  and  /3  the  effective  potential  was  approximately 
constant  for  values  of  its  argument  larger  than  1.  In  figure  (4a)  we  show  the 
result  that  is  obtained  applying  standard  regularization  theory  to  approxi¬ 
mate  the  data  set.  The  value  of  the  regularization  parameter  A  is  10~2,  and 
the  result  obtained  after  200  iterations  of  the  gradient  descent  algorithm  is 
shown.  The  solution,  that  almost  interpolates  the  data  set,  hardly  resembles 
a  cosine  function,  due  to  the  outlier.  If  the  springs  are  allowed  to  break,  we 
obtain  the  result  shown  in  figure  (4b),  after  only  10  iterations  of  gradient 
descent:  the  spring  of  the  outlier  breaks,  and  the  solution  is  not  influenced 
by  the  outlier.  Since  the  variance  of  the  Gaussian  Green’s  function  is  small 
(a  «  0.3)  the  solution  has  a  “hole”  in  correspondence  of  the  outlier,  because 
there  are  no  data  there.  A  similar  situation  is  shown  in  figures  (4c)  and  (4d), 
the  only  difference  being  the  value  of  the  regularization  parameter,  that  is  ten 
times  larger,  that  is  A  =  10-1.  We  notice  that,  since  the  Green’s  function  is 
bounded,  increasing  the  regularization  parameter  has  the  effect  of  decreasing 
the  norm  of  the  solution  (see  Poggio  and  Girosi,  1989).  This  effect  is  evident 
when  comparing  figures  (4a)  and  (4b)  with  figures  (4c)  and  (4d). 

5.2  Negative  examples 

In  order  to  test  the  negative  example  technique  we  choose  again,  as  a 
function  to  be  approximated,  the  cosine  function  f(x)  =  cos(x),  randomly 
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Figure  I:  Negative  examples. (a )  acid  (l>):  f  lie  <  onligurat ion>  i urresponding 
to  different  minima  of  the  Imn  tional.  («•):  tlieelfeel  ofim  leasing  t  lie  attrac¬ 
tiveness  ,>t  i  lie  standard  spring''.  See  text  tor  explanation 


sampled  at  seven  points  in  the  interval  [—1,1].  In  all  the  experiments  the 
regularization  parameter  was  set  to  zero,  its  role  not  being  crucial  in  this 
case.  The  Green’s  functions  we  used  were  always  Gaussians,  with  variance 
different  from  case  to  case.  The  fourth  data  point,  whose  coordinates  were 
(x,j/)  =  (—0.15,0.99),  was  selected  as  the  negative  example,  and  the  pa¬ 
rameters  0“  and  t  was  the  same  as  in  the  previous  case,  so  that  the  springs 
could  break  if  the  elongation  were  larger  than  1.  This  meant  that  the  result 
had  to  be  a  function  f*(x)  that  approximates  the  six  “positive”  examples, 
but  such  that  |/*(— 0.15)  —  0.99|  >  1.  There  are  clearly  two  possibilities: 
(/*(— 0.15)  —  0.99)  >  1  and  (0.99  —  /*(— 0.15))  >  1,  corresponding  to  func¬ 
tions  “passing  above  and  below  the  negative  example” .  These  configurations 
corresponds  to  two  different  minima  of  the  functional,  and  we  expect  to  ob¬ 
tain  one  of  these  two  configurations  depending  on  the  initial  conditions  of 
the  gradient  descent  algorithm. 

In  figure  (5a)  and  (5b)  we  show  two  results  corresponding  to  two  different 
local  minima.  Convergence  was  reached  in  50  iterations,  and  in  both  cases 
the  variance  of  the  Gaussian  is  a  —  0.2.  In  figure  (5a)  we  set  as  initial 
condition  c,  =  yit  and  in  figure  (5b)  we  set  c,  =  0.0.  In  the  first  case  the 
initial  condition  corresponds  to  a  function  that  is  “above”  the  data,  while 
in  the  second  case  the  initial  function  is  zero  everywhere,  and  then  “below” 
the  data.  In  the  first  case  the  final  value  of  the  “energy”  of  the  system  was 
H  —  —0.996,  that  is  very  close  to  the  global  minimum  energy  H  =  —1.0, 
while  in  the  second  case  the  energy  was  H  =  —0.931.  Interpreting  these 
results  in  terms  of  springs,  it  is  evident,  in  figure  (5b),  that  the  spring  on  the 
left  of  the  negative  example  is  not  sufficiently  strong  to  pull  up  the  solution 
to  the  datum.  We  then  changed  the  elastic  constant  of  this  spring  and  of  the 
corresponding  one  on  the  right  of  the  negative  example,  setting  their  values  to 
10,  that  is  ten  times  larger  than  the  other  ones.  The  result  is  shown  in  figure 
(5c),  and  it  is  clearly  better  than  the  one  shown  in  figure  (5b),  its  associated 
energy  being  H  =  —0.995,  that  is  comparable  with  the  value  H  =  —0.996  of 
figure  (5a). 

From  the  previous  result  and  many  other  experiments  it  is  apparent  that 
the  energy  landscape  associated  with  this  minimization  problem  could  be 
very  complicated,  with  many  local  minima  corresponding  to  the  two  types 
of  configurations  (“above”  and  “below”).  It  is  natural  to  ask  whether  during 
the  gradient  descent  iterations  the  system  naturally  “jumps”  from  one  of 
these  configurations  to  the  other  one.  The  answer  is  given  in  figures  (6a)  and 
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(6b).  In  figure  (6a)  we  show  the  configuration  of  the  system  corresponding 
to  the  iterations  1,  30,  31  and  40  of  the  gradient  descent  algorithm.  The 
variance  of  the  Gaussian  Green’s  function  is  a  —  0.8,  and  the  starting  point 
of  the  descent  procedure  is  c,  =  0.0.  At  the  beginning  the  configuration  is  of 
type  “below” ,  because  it  is  identically  zero,  and  then  it  stabilizes  around  an 
interpolating  function  until  iteration  30.  At  iteration  30  the  system  jumps 
in  a  configuration  of  type  “above”,  whose  energy  is  much  lower,  and  then 
converges  rapidly  to  a  local  mimimum.  In  figure  (6b)  the  energy  of  the 
system  is  shown  as  function  of  the  number  of  iterations:  notice  the  jump  at 
iteration  30,  that  probably  corresponds  to  a  discontinuity  of  the  gradient  of 
the  energy  surface. 

In  order  to  escape  local  minima  we  used  a  simple  form  of  stochastic 
gradient  descent,  adding  a  white  noise  term  to  eq.  (15).  The  noise  term 
was  used  only  to  get  out  of  local  minima,  that  is  it  was  switched  on  only 
when  the  energy  decreased,  from  one  iteration  to  the  next  one,  of  an  amount 
lower  than  a  small  threshold  (usually  10-8).  The  usefulness  of  the  noise  is 
shown  in  figures  (7a)  and  (6b).  The  data  are  the  same  of  figures  (6)  and  (5), 
but  the  break  point  of  the  spring  of  the  negative  example  has  value  1,  the 
variance  of  the  Gaussian  Green’s  function  is  <r  =  0.4  and  the  amplitude  of 
the  noise  is  10~2.  In  figure  (7a)  we  show  the  result  of  the  gradient  descent 
algorithm  without  noise.  Convergence  was  obtained  after  25  iterations,  and 
the  result  is  not  very  good,  corresponding  to  some  local  minimum.  In  figure 
(7b)  we  show  the  result  of  the  stochastic  gradient  descent  algorithm  after  1000 
iterations:  the  local  minima  have  been  escaped,  and  the  result  is  almost  a 
perfect  interpolant  on  the  “positive”  examples. 

Interesting  effects  take  place  if  we  raise  the  amplitude  of  the  noise.  In 
figure  (8a),  (8b)  and  (8c)  we  show  what  happens  if,  in  the  previous  example, 
we  set  the  amplitude  of  the  noise  to  10“l,  instead  of  10-2.  The  results  of 
the  stochastic  descent  procedure  are  shown  at  iterations  200,  500  and  2000. 
We  notice  that  the  system  jumps  from  a  configuration  of  type  “below”  to  a 
configuration  of  type  “above”  and  then  to  a  configuration  of  type  “below” 
again.  This  suggests  that  there  are  several  local  minima,  and  the  noise  makes 
the  system  jumping  from  one  to  another.  In  figure  (8d)  the  energy  of  the 
system  as  function  of  the  number  of  iterations  is  shown:  notice  that  the 
algorithm  does  not  inject  the  noise  continuosly,  but  only  when  the  energy 
stops  decreasing. 
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6  Remarks 


1.  The  first  extension  we  have  introduced  here  -  to  deal  with  unreliable 
data  -  may  be  important  in  problem  of  the  type  of  surface  reconstruc¬ 
tion,  as  one  encounters  in  computer  vision.  It  may  or  may  not  be  useful 
in  problems  of  learning  from  examples. 

2.  The  second  extension  -  to  exploit  negative  examples  -  is  especially 
interesting  for  the  problem  of  learning,  where  often  negative  examples 
are  present  (though  they  usually  are  less  important  than  the  positive 
ones).  In  some  cases  it  may  be  useful  also  in  problems  of  approximation 
of  functions.  There  are  situations  in  which  one  knows  that  certain 
regions  of  the  range  of  the  function  are  forbidden.  Interestingly,  this 
type  of  problems  seems  to  have  been  ignored  in  the  classical  approach  to 
function  approximation  (see  Verri  and  Poggio,  1988  for  related,  simpler 
and  more  classical  cases).  The  functional  we  considered,  and  then  the 
type  of  spring  we  used,  is  feasible  of  further  modifications,  according  to 
the  a  priori  knowledge  about  the  system.  For  example,  the  constraint 
that  the  values  of  a  one  dimensional  function  are  bounded  from  above 
(and/or  below)  can  be  included  using  springs  that  are  negative  from 
one  side  and  positive  from  the  other  side. 

3.  In  both  the  extensions  that  we  have  presented  the  solution  has  the 
form  (13),  which  has  a  simple  interpretation  in  terms  of  feedforward 
networks  with  one  layer  of  hidden  units,  of  the  same  class  of  the  regu¬ 
larization  networks  introduced  in  previous  papers  (Poggio  and  Girosi, 
1989;  1990). 
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